Given a prediction $f$ and grid locations $(g_i)_{i=1}^N$ convert $f$ into "percentiles": $$ p(g) = \frac{1}{N} \sum_{i=1}^N [f(g) \geq f(g_i)] $$ where $[\cdot]$ is the Iverson bracket, and $g$ is a grid location. This process has an obviously close link with picking a coverage level.
If we then have actual events $(x_i)_{i=1}^n$ we convert these into a series of values $(a_i)_{i=1}^n \subseteq [0,1]$ by $$ a_i = p(x_i) $$
We now proceed to work with the series $(a_i)$. Source [1] seems to
Look at the mean $\mu = \frac1n \sum_{i=1}^n a_i$
If we have two predictions $f_1,f_2$ form $p_1,p_2$ and hence series $(a_i^{(1)})$ and $(a_i^{(2)})$. Then compare these, e.g. by comparing the means, or looking at $$ \delta = \frac1n \sum_{i=1}^n [a_i^{(1)} > a_i^{(2)}]. $$
Paper [1] works with these ideas in a hypothesis testing framework, but it is not clear to us if the used z-test is appropriate, given the likely correlation between the different $a_i$.
In [ ]: